Skip to content

[plan] Extract ResourceCache and PythonResourceBridge from AgentPlan#548

Open
weiqingy wants to merge 3 commits intoapache:mainfrom
weiqingy:issue_547
Open

[plan] Extract ResourceCache and PythonResourceBridge from AgentPlan#548
weiqingy wants to merge 3 commits intoapache:mainfrom
weiqingy:issue_547

Conversation

@weiqingy
Copy link
Collaborator

Linked issue: #547

Purpose of change

AgentPlan (624 lines) mixes plan definition, resource caching/resolution, Python bridge wiring, and serialization.
This PR extracts two classes to separate concerns:

  • ResourceCache — lazy resource resolution, caching, and cleanup. Created by the operator in open(), owned by
    the operator lifecycle.
  • PythonResourceBridge — static discoverPythonMCPResources() for Python MCP tool/prompt discovery. Called
    during operator init.

After extraction, AgentPlan becomes immutable after construction (~490 lines, down from 624). The removed public
methods are getResource(), close(), and setPythonResourceAdapter().

Tests

  • mvn test -pl plan — all plan module tests pass
  • mvn test -pl runtime — all runtime module tests pass
  • ./tools/lint.sh -c — formatting check passed
  • ./tools/ut.sh -j — full Java test suite passed

API

Yes. Three public methods removed from AgentPlan:

  • getResource(String, ResourceType) — replaced by ResourceCache.getResource()
  • close() — replaced by ResourceCache.close()
  • setPythonResourceAdapter(PythonResourceAdapter) — replaced by PythonResourceBridge.discoverPythonMCPResources()

Documentation

  • doc-needed
  • doc-not-needed
  • doc-included

@github-actions github-actions bot added priority/major Default priority of the PR or issue. fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. doc-not-needed Your PR changes do not impact docs labels Feb 22, 2026
@weiqingy
Copy link
Collaborator Author

Both CI failures are unrelated to our changes:

  1. it-python [flink-2.2] — Ollama download returned HTTP 404. This is a transient infrastructure issue — the Ollama release artifact was temporarily unavailable. Nothing to fix on our side.
  2. cross-language — Py4JError / Py4JNetworkError in the cross-language e2e tests. The Java gateway crashed during test_java_chat_model_integration and subsequent tests couldn't connect. Our PR only touches plan/ and runtime/ Java code — it doesn't affect the Python-Java bridge or cross-language e2e test infrastructure.

Copy link
Collaborator

@wenjin272 wenjin272 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @weiqingy and I think the separation makes sense.

Besides, I think we should also apply this separation in the python side. If possible, could you implement this in this PR?

public class ResourceCache implements AutoCloseable {

private final Map<ResourceType, Map<String, ResourceProvider>> resourceProviders;
private final Map<ResourceType, Map<String, Resource>> cache = new HashMap<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use ConcurrentHashMap here? For task submit by ctx.durableExecuteAsync may read/write this hashmap parallel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! This is intentional. After the refactoring, ResourceCache is created and owned by ActionExecutionOperator.open(), so it's scoped to a single operator subtask. In Flink's execution model, all access (processElement, open, close) runs on the same mailbox thread. durableExecuteAsync dispatches work to external threads, but resource resolution from the cache happens on the operator thread before dispatch. So HashMap is sufficient and avoids the overhead of ConcurrentHashMap. Let me know if this matches your understanding.

Copy link
Collaborator

@wenjin272 wenjin272 Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but resource resolution from the cache happens on the operator thread before dispatch

I think the resource resolution may not always happens before dispatch.

Take the built in chat action as an example. In chat action, we submit chat task to external threads, and the chat task will call chat method of BaseChatModelSetup, which occurs in an asynchronous thread. In the chat method of chat model setup, it will resolves the correspond connection, prompt and tools, which I think may lead to concurrent access to the cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks for pushing back on this. I traced the code path more carefully:

ChatModelAction → durableExecuteAsync(callable) → async pool thread runs chatModel.chat() → BaseChatModelSetup.chat() calls this.getResource.apply() to resolve connection, prompt, and tools → which hits ResourceCache.getResource().

So resource resolution does happen on async threads, not just the mailbox thread. I'll switch back to ConcurrentHashMap. Good catch!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR.

@weiqingy
Copy link
Collaborator Author

weiqingy commented Mar 2, 2026

Thanks for the review, @wenjin272! I’ve updated the PR to apply the same separation on the Python side as well - could you please take another look?

Copy link
Collaborator

@wenjin272 wenjin272 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you take a look at your convenience @xintongsong ?

@weiqingy
Copy link
Collaborator Author

weiqingy commented Mar 2, 2026

I checked the CI failures - both are LLM-dependent e2e tests and don’t appear to be caused by this PR.

Test 1 (react_agent_test): The output 4444 = 2123 + 2321 proves our ResourceCache IS working correctly — the chat model was resolved, the add tool was resolved and called successfully. The LLM (qwen3:1.7b) simply stopped after one tool call instead of continuing to call multiply(4444, 312). This is LLM non-determinism.

Test 2 (long_term_memory_test): This runs on the Flink remote runner, where there's exactly ONE FlinkRunnerContext with ONE ResourceCache. The behavior is identical to before. The failure is assert len(doc) == 1 after LLM-based compaction using qwen3:8b — if the model's summarization response is malformed, compaction produces incorrect output.

We can re-run CI to confirm flakiness — if it fails again with different assertion values, that would further support LLM non-determinism.

@wenjin272 do you have access to re-run the CI tests? It looks like admin rights are required.

@wenjin272
Copy link
Collaborator

Hi, @weiqingy, sorry for I don't have access to re-run the CI. I acknowledge that the failing test is due to its own flakiness and not caused by this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants